========================================================

Description of Dataset and Preparation

The dataset which is related to white variants of the Portuguese “Vinho Verde” wine, consists of several physicochemical sample test values and an output sensory variable. The output value is the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Variables info:

  1. Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).

  2. Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

  3. Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to wines.

  4. Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 g/L and wines with greater than 45 g/L are considered sweet.

  5. Chlorides: the amount of salt in the wine.

  6. Free sulphur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulphite ion; it prevents microbial growth and the oxidation of wine.

  7. Total sulphur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

  8. Density: the density of water is close to that of water depending on the percent alcohol and sugar content.

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

  10. Sulphates: a wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.

  11. Alcohol: the percent alcohol content of the wine.

  12. Quality: output variable based on sensory data with values between 0 and 10.

Load Libraries and Data

# Load Data
wines <- read.csv("wineQualityWhites.csv")

# View data
formattable(wines, list(area(col = c(quality)) 
                        ~ normalize_bar("yellow", 0.2))) %>% as.datatable()

Univariate Plots Section

Take a look of the data structure and get a statistical summary.

# Create a new bound form of SO2 variable.
wines$bound.sulfur.dioxide <- wines$total.sulfur.dioxide -
  wines$free.sulfur.dioxide

# Create total acidity.
wines$total.acidity <- wines$fixed.acidity +
  wines$volatile.acidity

# Create a rating variable from quality.
wines$rating <- NA
wines$rating <- ifelse(wines$quality < 5, "Undrinkable",
                       ifelse(wines$quality < 6, "Drinkable",
                              ifelse(wines$quality < 7, "Average",
                                     ifelse(wines$quality < 8, "Good", "Great"))))
wines$rating <- factor(wines$rating)
wines$rating <- ordered(wines$rating, levels = c("Undrinkable", "Drinkable",
                                                 "Average", "Good", "Great"))

# str(wines) # uncomment to see the structure of wines dataset.
pandoc.table(summary(wines))
Table continues below
X fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
Median :2450 Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
Table continues below
chlorides free.sulfur.dioxide total.sulfur.dioxide density
Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917
Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937
Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940
3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961
Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
Table continues below
pH sulphates alcohol quality bound.sulfur.dioxide
Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000 Min. : 4.0
1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.: 78.0
Median :3.180 Median :0.4700 Median :10.40 Median :6.000 Median :100.0
Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878 Mean :103.1
3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000 3rd Qu.:125.0
Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000 Max. :331.0
total.acidity rating
Min. : 4.110 Undrinkable: 183
1st Qu.: 6.570 Drinkable :1457
Median : 7.070 Average :2198
Mean : 7.133 Good : 880
3rd Qu.: 7.590 Great : 180
Max. :14.470 NA

The white wines dataset includes 12 variables with almost 5000 observations. Residual sugar concentration is too low for most of the wines with a maximum value at about 66 g/L and 3rd quantile at 9.9. Therefore, there are only few sweet white wines in the list, since sweetness considered for values greater than 45 g/L.

For most wines free sulphur dioxide is below 50 ppm. Values of pH ranges from 2.7 to 4 and alcohol content ranges from 8 to 14. Quality ranges only from 3 to 9 (no 0, 1, 2 and 10 values) with a median at 6.

# Create a function for plotting.
f_plot1 <- function(x) {
  ggplot(data = wines, aes_string(x = x))
}

p1 <- f_plot1("fixed.acidity")  + 
  geom_histogram(binwidth = 0.2)

p2 <- f_plot1("volatile.acidity")  + 
  geom_histogram(binwidth = 0.02)

p3 <- f_plot1("total.acidity")  + 
  geom_histogram(binwidth = 0.2)

p4 <- f_plot1("citric.acid")  + 
  geom_histogram(binwidth = 0.02)

p5 <- f_plot1("residual.sugar")  + 
  geom_histogram(binwidth = 0.06) +
  scale_x_log10()

p6 <- f_plot1("chlorides")  + 
  geom_histogram(binwidth = 0.002)

p7 <- f_plot1("free.sulfur.dioxide")  + 
  geom_histogram(binwidth = 0.1) +
  scale_x_log10()

p8 <- f_plot1("total.sulfur.dioxide")  + 
  geom_histogram(binwidth = 0.05) +
  scale_x_log10()

p9 <- f_plot1("bound.sulfur.dioxide")  + 
  geom_histogram(binwidth = 0.06) +
  scale_x_log10()

p10 <- f_plot1("density")  + 
  geom_histogram(binwidth = 0.0004)

p11 <- f_plot1("pH")  + 
  geom_histogram(binwidth = 0.04)

p12 <- f_plot1("sulphates")  + 
  geom_histogram(binwidth = 0.03)

p13 <- f_plot1("alcohol")  + 
  geom_histogram(binwidth = 0.2)

p14 <- f_plot1("quality")  + 
  geom_histogram(binwidth = 1)

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9,
             p10, p11, p12, p13, p14, ncol = 3)

The above histograms were adjusted and the long tail data were transformed to the log10 scale to better understand the distributions.

All of the data were used and no outliers were removed. Since these data were derived from chemical tests, it can be safely considered that for each observation at least 2 measurements were conducted in order to handle experimental errors. As it is known, an outlier may be due to variability in the measurement or it may indicate experimental error; the latter only is sometimes excluded from the dataset.

All the distributions seem to be normal except for residual sugar which is bimodal. There are few 0 citric acid values. Most of wines are of ‘6’ grade or ‘average’ quality based on the new created rating scale.

p1 <- ggplot(data = wines, aes(x = rating)) + 
  geom_bar()

p2 <- ggplot(data = wines, aes(x = quality)) +
  geom_density(fill = "orangered", alpha = 0.5) +
  scale_x_continuous(breaks = seq(3, 9, 1))

grid.arrange(p1, p2, ncol = 1)

Wines with high residual sugar concentrations probably have the higher density, as sugars contribute to density. Checking sweetness,

by(wines$density, wines$residual.sugar > 45, summary)
## wines$residual.sugar > 45: FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0103 
## -------------------------------------------------------- 
## wines$residual.sugar > 45: TRUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.039   1.039   1.039   1.039   1.039   1.039
table(wines$residual.sugar > 45)
## 
## FALSE  TRUE 
##  4897     1

it is found that there is only one sweet wine, which also has the higher observed density in the list.

Univariate Analysis

What is the structure of your dataset?

  • The white wines dataset consists of 4898 observations with 11 numerical variables and one variable, quality, which can be considered as categorical.
  • Zero numerical values were observed only for the citric acid variable.
  • Residual sugar is the only variable which follows a bimodal distribution. This clearly shows that there are two groups of wines:

    • Dry wines with low sugars below 5 g/L
    • Semi or Off-Dry wines with sugars higher than 5 g/L (probably late harvest wines)

What is/are the main feature(s) of interest in your dataset?

Although quality is considered the most important feature, the relationships between all of the variables will be further studied.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It is known that pH, acidity, sugar content, SO2 concentrations and salinity (chlorides) play an important role in the taste, flavours, aromas, structure, colour and ageability of wines.

Did you create any new variables from existing variables in the dataset?

Three new variables were created:

  • Bound sulphur dioxide which is the difference between total and free SO2.
  • Total acidity variable from the addition of fixed and volatile acidity, since its overall value determines how a wine will taste, feel in mouth and how well it will age.
  • A new rating categorical variable based on the existing quality grade.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • Residual sugar concentration was the only bimodal distribution which clearly shows two types of white wines in the list.
  • A new rating quality parameter was created as an orderd factor variable to facilitate and improve the exploratory analysis.
  • Histograms’ bin size was tuned and in some cases the x scale was transformed to log10 in order to reveal new patterns in the data.

Bivariate Plots Section

Explore the correlations between the variables.

wines_corr_r <- round(cor(wines[c(2:13)]), 1)
par(xpd=TRUE)
corrplot(wines_corr_r, method = "number", mar = c(2, 0, 1, 0))

Several interesting correlations were found:

Examine further the above relationships visualizing them with scatterplots and encircling wines with quality over 9.

p1 <- ggplot(data = wines, aes(x = alcohol, y = rating)) +
  geom_jitter(alpha = 0.1)

p2 <- ggplot(data = wines, aes(x = round(alcohol / 0.5) * 0.5, y = quality)) +
  geom_line(stat = "summary", fun.y = mean)

p3 <- ggplot(data = wines, aes(x = residual.sugar, y = density)) +
  geom_jitter(alpha = 0.05) +
  xlim(0, 20) +
  ylim(0.98, 1.01) +
  geom_smooth(method = "lm", color = "orange") +
  geom_encircle(aes(x = residual.sugar, y = density),
                data = wines[wines$quality >= 9, ],
                color = "red")

p4 <- ggplot(data = wines, aes(x = alcohol, y = density)) +
  geom_jitter(alpha = 0.05) +
  ylim(0.98, 1.01) +
  geom_smooth(method = "lm", color = "orange") +
  geom_encircle(aes(x = alcohol, y = density),
                data = wines[wines$quality >= 9, ],
                color = "red")

grid.arrange(p1, p2, p3, p4, ncol = 2)

At first glance, no visualising trend was revealed for quality-alcohol, even after applying jitter and changing transparency to prevent overplotting. However, the plot of alcohol vs. mean quality shows an increase in quality with alcohol number. Density-sugars pair exhibits a positive linear relationship and density-alcohol presents a negative one.

p1 <- ggplot(data = wines, aes(x = chlorides, y = density)) +
  geom_jitter(alpha = 0.1) +
  xlim(0.01, 0.07) +
  ylim(0.985, 1.005) +
  geom_smooth(method = "lm", color = "orange")

p2 <- ggplot(data = wines, aes(x = total.sulfur.dioxide, y = density)) +
  geom_jitter(alpha = 0.1) +
  xlim(0, 300) +
  ylim(0.985, 1.005) +
  geom_smooth(method = "lm", color = "orange")

p3 <- ggplot(data = wines, aes(x = residual.sugar, y = alcohol)) +
  geom_jitter(alpha = 0.1) +
  xlim(0, 20) +
  geom_smooth(method = "loess", color = "orange")

p4 <- ggplot(data = wines, aes(x = pH, y = fixed.acidity)) +
  geom_jitter(alpha = 0.1) +
  scale_y_log10() +
  geom_smooth(method = "lm", color = "orange")

p5 <- ggplot(data = wines, aes(x = alcohol, y = chlorides)) +
  geom_jitter(alpha = 0.1) +
  ylim(0, 0.1) +
  geom_smooth(method = "lm", color = "orange")

p6 <- ggplot(data = wines, aes(x = alcohol, y = total.sulfur.dioxide)) +
  geom_jitter(alpha = 0.1) +
  ylim(0, 300) +
  geom_smooth(method = "lm", color = "orange")

grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 2)

Some interesting observations can be made, which are in agreement with literature:

# Create a function for plotting.
f_plot2 <- function(x, y) {
  ggplot(wines, aes_string(x = "rating", y = y, fill = "rating")) +
  geom_boxplot(alpha = 0.5) +
  stat_summary(fun.y = mean, geom = "point", shape = 8, size = 3) +
  scale_fill_brewer(palette = "Pastel2") +
  theme(legend.position = "none") +
  theme(axis.title.x = element_blank())
}

p1 <- f_plot2(y = "total.acidity") + ylim(5, 10)

p2 <- f_plot2(y = "citric.acid") + ylim(0, 0.75)

p3 <- f_plot2(y = "residual.sugar") + ylim(0, 20)

p4 <- f_plot2(y = "chlorides") + ylim(0, 0.1)

p5 <- f_plot2(y = "free.sulfur.dioxide") + ylim(0, 100)

p6 <- f_plot2(y = "total.sulfur.dioxide") + ylim(0, 300)

p7 <- f_plot2(y = "density") + ylim(0.99, 1)

p8 <- f_plot2(y = "pH")

p9 <- f_plot2(y = "sulphates") + ylim(0.2, 1)

p10 <- f_plot2(y = "alcohol")

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, ncol = 2)

It seems that:

ggplot(data = wines, aes(x = total.acidity, y = quality)) +
  geom_jitter(alpha = 0.2) +
  xlim(5, 10)

ggplot(data = wines, aes(x = chlorides, y = quality)) +
  geom_jitter(alpha = 0.2) +
  xlim(0, 0.1)

ggplot(data = wines, aes(x = free.sulfur.dioxide, y = quality)) +
  geom_jitter(alpha = 0.2) +
  xlim(0, 100)

Unfortunately, the study of the scatterplots of the above variable-quality pairs reveal no visual trends.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main feature of interest, quality, seems to vary with alcohol, density, SO2, sugars and chlorides. Generally, a “bad” wine in this list has a higher acidity, chlorides and density, while its SO2 levels are lower.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It is worth noting the weak negative correlation of SO2-alcohol. A better understanding of the chemistry of SO2 can explain a competitive relation between SO2 and alcohol.

Alcohol is produced by the Saccharomyces yeast strains during alcoholic fermentation. Sulphur dioxide (SO2) is a natural by-product of winemaking as a small quantity is produced during the alcoholic fermentation by yeasts.

In practice however, SO2 is added by the winemakers either as a preservative or prior alcoholic fermentation to control the growth of microorganisms, which stops the production of alcohol. This benefits the flavours of white wines, since the enzyme polyphenol oxidase is inhibited and less oxidative browning of the juice occurs. This helps to preserve the fruity and floral aromas found in the juice.

What was the strongest relationship you found?

The strongest relationships was between density and sugars or alcohol:

  • sugars in wine tend to increase density, and
  • alcohol decrease density.

Multivariate Plots Section

Classify wines to dry and off-dry type based on the residual sugar content. Create a new data frame, that contains information on quality and type of wine.

# Create a type variable from residual.sugar.
wines$type <- NA
wines$type <- ifelse(wines$residual.sugar < 5,
                     "dry", "off_dry")

wines.alcohol_by_type <- wines %>% 
  group_by(quality, type) %>% 
  summarise(mean_alcohol = mean(alcohol),
            median_alcohol = median(alcohol),
            n = n()) %>% 
  arrange(quality)

ggplot(data = wines.alcohol_by_type, 
       aes(x = quality, y = mean_alcohol)) +
  geom_line(aes(color = type)) +
  scale_x_continuous(breaks = seq(3, 9, 1))

There is an increasing trend in quality with alcohol which is more important for the dry wines but obviously more data are needed to make robust conclusions.

f_plot3 <- function(x, y) {
  ggplot(data = wines, 
  aes_string(x = x, y = y, color = "rating")) +
  geom_smooth(method = "lm", alpha = 0.2, size = 0.5)
}

p1 <- f_plot3(x = "residual.sugar", y = "density") +
  xlim(0, 25)

p2 <- f_plot3(x = "alcohol", y = "density")

p3 <- f_plot3(x = "chlorides", y = "density")

p4 <- f_plot3(x = "total.sulfur.dioxide", y = "density")

p5 <- f_plot3(x = "residual.sugar", y = "alcohol")

p6 <- f_plot3(x = "pH", y = "fixed.acidity")

p7 <- f_plot3(x = "alcohol", y = "chlorides")

p8 <- f_plot3(x = "alcohol", y = "total.sulfur.dioxide")

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)

There is nothing unexpected in the above plots except for a different visual behaviour between “low” and “high” quality wines. For example, “good” and “great” wines show a higher increase of density than “average” wines.

f_plot4 <- function(x, y) {
  ggplot(data = wines, 
  aes_string(x = x, y = y)) +
  geom_jitter(aes(color = rating), alpha = 0.5) +
  scale_color_brewer(type = "qual")
}

p1 <- f_plot4(x = "residual.sugar", y = "density") +
  xlim(0, 25) +
  ylim(0.985, 1.005)

p2 <- f_plot4(x = "alcohol", y = "density") +
  ylim(0.985, 1.005)

p3 <- f_plot4(x = "chlorides", y = "density") +
  xlim(0, 0.2) +
  ylim(0.985, 1.005)

p4 <- f_plot4(x = "total.sulfur.dioxide", y = "density") +
  xlim(0, 300) +
  ylim(0.985, 1.005)

p5 <- f_plot4(x = "residual.sugar", y = "alcohol") +
  xlim(0, 22)

p6 <- f_plot4(x = "pH", y = "fixed.acidity") +
  ylim(5, 11)

p7 <- f_plot4(x = "alcohol", y = "chlorides") +
  ylim(0, 0.15)

p8 <- f_plot4(x = "alcohol", y = "total.sulfur.dioxide") +
  ylim(0, 300)

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol = 2)

It is very difficult to find patterns, but generally, it seems that lower quality wines have lower alcohol and higher chlorides.

A first attempt to build a linear model and use the variables in the linear model to predict the quality of a wine was not very successful, since the biggest R-squared value is only 0.282.

m1 <- lm(quality ~ alcohol, data = wines)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + total.sulfur.dioxide)
m6 <- update(m5, ~ . + fixed.acidity)
m7 <- update(m6, ~ . + residual.sugar)
m8 <- update(m7, ~ . + pH)
m9 <- update(m8, ~ . + sulphates)
m10 <- update(m9, ~ . + free.sulfur.dioxide)
m11 <- update(m10, ~ . + citric.acid)
mtable(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, sdigits = 3)
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + volatile.acidity, 
##     data = wines)
## m4: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides, data = wines)
## m5: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide, data = wines)
## m6: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity, data = wines)
## m7: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar, 
##     data = wines)
## m8: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar + 
##     pH, data = wines)
## m9: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar + 
##     pH + sulphates, data = wines)
## m10: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar + 
##     pH + sulphates + free.sulfur.dioxide, data = wines)
## m11: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + residual.sugar + 
##     pH + sulphates + free.sulfur.dioxide + citric.acid, data = wines)
## 
## ===============================================================================================================================================================
##                            m1          m2          m3          m4          m5          m6          m7          m8           m9           m10          m11      
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            2.582***  -22.492***  -36.499***  -35.573***  -30.759***  -43.308***   60.251***   130.584***   162.786***   149.901***   150.193***  
##                         (0.098)     (6.165)     (6.001)     (6.010)     (6.295)     (6.493)    (14.109)     (17.934)     (18.569)     (18.760)     (18.804)    
##   alcohol                0.313***    0.360***    0.399***    0.389***    0.391***    0.407***    0.305***     0.222***     0.184***     0.194***     0.193***  
##                         (0.009)     (0.015)     (0.014)     (0.015)     (0.015)     (0.015)     (0.019)      (0.023)      (0.024)      (0.024)      (0.024)    
##   density                           24.728***   38.992***   38.217***   33.251***   46.423***  -57.411***  -130.265***  -162.939***  -149.987***  -150.284***  
##                                     (6.079)     (5.920)     (5.926)     (6.234)     (6.458)    (14.123)     (18.195)     (18.839)     (19.029)     (19.075)    
##   volatile.acidity                              -2.072***   -2.043***   -2.070***   -2.108***   -2.094***    -2.021***    -1.966***    -1.868***    -1.863***  
##                                                 (0.110)     (0.111)     (0.111)     (0.111)     (0.110)      (0.110)      (0.110)      (0.112)      (0.114)    
##   chlorides                                                 -1.300*     -1.370*     -1.383*     -0.858       -0.267       -0.153       -0.234       -0.247     
##                                                             (0.542)     (0.543)     (0.540)     (0.540)      (0.546)      (0.544)      (0.543)      (0.547)    
##   total.sulfur.dioxide                                                   0.001*      0.001*      0.001**      0.001**      0.001*      -0.000       -0.000     
##                                                                         (0.000)     (0.000)     (0.000)      (0.000)      (0.000)      (0.000)      (0.000)    
##   fixed.acidity                                                                     -0.099***   -0.045**      0.044*       0.066**      0.066**      0.066**   
##                                                                                     (0.014)     (0.015)      (0.021)      (0.021)      (0.021)      (0.021)    
##   residual.sugar                                                                                 0.045***     0.075***     0.087***     0.081***     0.081***  
##                                                                                                 (0.005)      (0.007)      (0.007)      (0.008)      (0.008)    
##   pH                                                                                                          0.665***     0.707***     0.684***     0.686***  
##                                                                                                              (0.105)      (0.105)      (0.105)      (0.105)    
##   sulphates                                                                                                                0.638***     0.632***     0.631***  
##                                                                                                                           (0.100)      (0.100)      (0.100)    
##   free.sulfur.dioxide                                                                                                                   0.004***     0.004***  
##                                                                                                                                        (0.001)      (0.001)    
##   citric.acid                                                                                                                                        0.022     
##                                                                                                                                                     (0.096)    
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190      0.192       0.247       0.248       0.249       0.257       0.267        0.273        0.279        0.282        0.282   
##   adj. R-squared            0.190      0.192       0.246       0.247       0.248       0.256       0.266        0.272        0.278        0.280        0.280   
##   sigma                     0.797      0.796       0.769       0.768       0.768       0.764       0.759        0.756        0.753        0.751        0.751   
##   F                      1146.395    583.290     534.843     402.956     324.034     281.812     254.596      229.523      210.136      191.810      174.344   
##   p                         0.000      0.000       0.000       0.000       0.000       0.000       0.000        0.000        0.000        0.000        0.000   
##   Log-likelihood        -5839.391  -5831.127   -5660.164   -5657.292   -5654.027   -5627.454   -5593.583    -5573.700    -5553.598    -5543.767    -5543.740   
##   Deviance               3112.257   3101.773    2892.625    2889.234    2885.385    2854.246    2815.042     2792.280     2769.454     2758.359     2758.329   
##   AIC                   11684.782  11670.255   11330.329   11326.584   11322.054   11270.908   11205.165    11167.399    11129.197    11111.534    11113.480   
##   BIC                   11704.272  11696.241   11362.812   11365.563   11367.530   11322.880   11263.634    11232.365    11200.659    11189.493    11197.936   
##   N                      4898       4898        4898        4898        4898        4898        4898         4898         4898         4898         4898       
## ===============================================================================================================================================================

The second attempt was to build a decision tree model, since it is more robust, handles non linearity and performs well with both numerical and categorical data. Two types of the wine dataset were examined:

fit <- rpart(rating ~ alcohol + density + volatile.acidity + chlorides +
               total.sulfur.dioxide + fixed.acidity + residual.sugar +
               pH + sulphates + free.sulfur.dioxide + citric.acid,
               data = wines,
               method="class")

rpart.plot(fit)

# Create new categorical variables with high-low levels.
wines$free.sulfur.dioxide_hilo <- NA
wines$free.sulfur.dioxide_hilo <- ifelse(wines$free.sulfur.dioxide < 50, 0, 1)
wines$free.sulfur.dioxide_hilo <- factor(wines$free.sulfur.dioxide_hilo)

wines$chlorides_hilo <- NA
wines$chlorides_hilo <- ifelse(wines$chlorides < 0.06, 0, 1)
wines$chlorides_hilo <- factor(wines$chlorides_hilo)

wines$volatile.acidity_hilo <- NA
wines$volatile.acidity_hilo <- ifelse(wines$volatile.acidity < 0.26, 0, 1)
wines$volatile.acidity_hilo <- factor(wines$volatile.acidity_hilo)

wines$citric.acid_hilo <- NA
wines$citric.acid_hilo <- ifelse(wines$citric.acid > 0, 1, 0)
wines$citric.acid_hilo <- factor(wines$citric.acid_hilo)

fit <- rpart(rating ~ alcohol +
               density +
               volatile.acidity_hilo +
               chlorides_hilo +
               free.sulfur.dioxide_hilo +
               fixed.acidity +
               residual.sugar +
               pH +
               sulphates +
               total.sulfur.dioxide +
               citric.acid_hilo,
               data = wines,
               method="class")

rpart.plot(fit)

Few of the variables seems to play a significant role in quality, such as alcohol, volatile acidity, free SO2 and sugars.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Several interesting relations were found and most of them were explained by literature reviewing. Some of the most important correlation pairs are:

  • Density vs sugars, alcohol and chlorides
  • Quality vs alcohol
  • Alcohol vs sugars

Were there any interesting or surprising interactions between features?

It was really surprising the quite strong positive relationship of quality and alcohol, since ethanol is not considered tasty.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Two models, a linear and a decision tree model were created to predict the quality of a wine, but the results were unsatisfactory. Obviously, additional observations or variables are needed to build a good prediction model.


Final Plots and Summary

Plot One

p1 <- ggplot(data = wines, aes(x = residual.sugar)) +
  geom_histogram(binwidth = 0.06, colour = "white",
                 alpha = 0.8, aes(y = ..density.., fill = ..count..)) +
  scale_fill_gradient("Count", low = "yellow", high = "brown") +
  scale_x_log10() +
  geom_density(colour = "orangered") +
  theme_light() +
  labs(title = "Histogram and Density Plot of Residual Sugar in White Wines",
       x = "Residual Sugar in g/L",
       y = "Density")

ggplotly(p1)

Description One

The distribution of the residual sugar appears to be bimodal on log scale, because there are two types of wine, the dry and the off-dry ones.

Plot Two

p2 <- plot_ly(data = wines[wines$density < 1.002, ], 
              x = ~density) %>%
  add_markers(y = ~alcohol,
              name = 'Alcohol',
              marker = list(color = 'rgba(88, 116, 152, 0.3)'),
              hoverinfo = "text",
              text = ~paste(alcohol)) %>%
  add_lines(y = ~fitted(lm(alcohol ~ density)),
            line = list(color = '#E86850'),
            name = "Linear smoother", 
            showlegend = FALSE) %>%
  add_markers(y = ~residual.sugar, 
              name = 'Residual Sugar',
              yaxis = 'y2', 
              hoverinfo = "text", 
              alpha = 0.3, 
              text = ~paste(residual.sugar, 'g/L')) %>%
  add_lines(y = ~fitted(loess(residual.sugar ~ density)),
            line = list(color = '#FFD800'),
            name = "Loess Smoother", 
            showlegend = TRUE) %>%
  layout(title = "How Alcohol and Residual Sugar Relate with Density",
         xaxis = list(title = "Density"),
         yaxis = list(side = 'left', 
                      title = "Percentage of Alcohol Content",
                      showgrid = FALSE, 
                      zeroline = FALSE),
         yaxis2 = list(side = 'right', 
                       overlaying = "y", 
                       title = "Residual Sugar in g/L", 
                       showgrid = FALSE, 
                       zeroline = FALSE), 
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

ggplotly(p2)

Description Two

Alcohol in wines tend to decrease their density in contrast to sugars which increases the density. Actually, it seems to be a linear correlation between alcohol and density with R-squared at 61%.

summary(lm(I(density) ~ I(alcohol), data = wines))
## 
## Call:
## lm(formula = I(density) ~ I(alcohol), data = wines)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.005475 -0.001238 -0.000153  0.001156  0.047201 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.014e+00  2.300e-04 4407.87   <2e-16 ***
## I(alcohol)  -1.896e-03  2.173e-05  -87.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001871 on 4896 degrees of freedom
## Multiple R-squared:  0.6086, Adjusted R-squared:  0.6085 
## F-statistic:  7613 on 1 and 4896 DF,  p-value: < 2.2e-16

Plot Three

colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
wines_good <- subset(wines, wines$free.sulfur.dioxide < 50 & 
                          wines$volatile.acidity < 0.26 & 
                          wines$citric.acid > 0 &
                          wines$chlorides < 0.06 &
                          wines$sulphates < 0.47)

p3 <- plot_ly(wines_good, 
              x = ~alcohol, y = ~density, 
              color = ~rating, size = ~quality, colors = colors,
        type = 'scatter', mode = 'markers', sizes = c(min(wines$quality), max(wines$quality))*2,
        marker = list(symbol = 'circle', sizemode = 'diameter',
                      line = list(width = 2, color = '#FFFFFF')),
        text = ~paste('Quality:', quality, '<br>Density:', density, '<br>Alcohol:', alcohol)) %>%
  layout(title = 'Alcohol vs. Density (Low SO2, Chlorides and Volatile Acidity - Citric Acid Aromas)',
         xaxis = list(title = 'Percentage of Alcohol Content',
                      gridcolor = 'rgb(255, 255, 255)',
                      range = c(8, 14),
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
         yaxis = list(title = 'Density (g/mL)',
                      gridcolor = 'rgb(255, 255, 255)',
                      range = c(0.985, 1.001),
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

ggplotly(p3)

Description Three

This plot shows again how density varies with alcohol number but this time only for selected wines with the following criteria:

  • free sulphur dioxide < 50 ppm (over this threshold SO2 becomes evident in the nose and taste of wine),
  • sulphates < 0.47 (contribute to sulphur dioxide gas levels),
  • chlorides < 0.06 g/L (increased levels of sodium chloride appears to convey undesirable soapy notes),
  • volatile acidity < 0.26 (high levels can lead to an unpleasant, vinegar taste),
  • citric acid > 0 (add ‘freshness’ and flavour to wines).

This was an attempt to examine wines with desirable physicochemical properties, but unfortunately this bucket still contains all qualities. Nevertheless, density decreases with alcohol.


Reflection

The white wines data set contains information on almost 5000 wines across 12 variables. Several interesting trends and relations were observed during the data exploration. There was a clear bimodal distribution of residual sugar which indicates two types of wines, dry and off-dry. The quality ranged from 5 to 7 for most of wines. There were strong correlations of density-sugars and density-alcohol, as it was expected, and an unusual positive relation of quality and alcohol. Several other relations were investigated with less importance.

Eventually, linear and decision tree methods were applied in order to build a prediction model of the quality using all the variables of the data set. It was found once again that alcohol presents the higher influence in wine quality. Unfortunately, it was really difficult to find patterns between quality and variables, except for the alcohol-quality trend. Obviously, this data set is too small or there is a need for other more crucial variables to be investigated.

It is really difficult to tightly define with only 12 physicochemical properties the complexity of a wine, which can come from different things such as:


References